library(tidyverse)
library(RColorBrewer)
# ggplot2 extensions
library(ggrepel)
library(GGally)
library(ggmagnify)
# data
library(palmerpenguins)Bio 724: More ggplot2
Libraries
Depicting 3D data using color
ggplot2 has no built-in support for 3D graphs; instead it emphasizes the use of color, or other aesthetics, to depict more than two dimensions
Contour plots and heat maps are among the most common ways to generate figures representing 3D data in ggplot
Contour plots
Contour plots are most useful when you want to represent a variable (say “z”) as a function of two other variables (x and y). Contour plots use nested lines (think contour lines on a topographic map) and optionally color to depict such relationships.
Take for example the mathematical function: \(z = \sin(x) + \cos(y)\). A 3D plot of this function looks like this (generated using Wolfram Alpha):
Below is code to generate the representation of a contour plot for the same function:
zfunc <- function(x,y) {
sin(x) + cos(y)
}
# try some other functions!
# e.g. z = sin(sqrt(x^2 + y^2))
# create sets of points over which to evaluate zfunc
x <- seq(-2*pi, 2*pi, length = 100)
y <- seq(-2*pi, 2*pi, length = 100)
# expand grid creates all pairwise combinations of x and y
xy <- expand_grid(x,y)
xyz <- mutate(xy, z = zfunc(x,y))
ggplot(xyz, aes(x = x, y = y, z = z)) +
geom_contour_filled() + # try with geom_contour() instead
coord_equal() I think a diverging color scheme makes more sense here, since the data is centered around zero. I like the “RdBu” (red-blue) diverging color scheme for data like this, but reversed (so blue = low, red = high)
blue.red.color.scheme <- rev(brewer.pal(9,"RdBu"))
ggplot(xyz, aes(x = x, y = y, z = z)) +
geom_contour_filled() +
# direction = -1 reverse the color scheme direction
scale_fill_brewer(palette = "RdBu", direction = -1) +
coord_equal() Heat maps
Heat maps are another way to depict 3D data with color. Typically the x, y axes represent a combination of a categorical variable and a continuous variable.
geom_tile (for small data) and geom_raster (for large data) are the standard ggplot2 functions for creating heat maps.
Here we create a heat map showing gene expression over time for about 700 genes that were identified as showing cell cycle oscillatory behavior by Spellman et al. 1998:
First load and filter the data:
yeast_long <- read_csv("spellman-long.csv")
cell_cycle <- read_csv("spellman-cell-cycle-list.txt")
# get just the cell cycle genes as given in the cdc28 exptl conditions
cdc28_cell_cycle <-
yeast_long |>
filter(gene %in% cell_cycle$Gene) |>
filter(expt == "cdc28")
# Q: how would you do the first filter step with a join?
# Q: could you combine the two filters above?Then draw the plot:
# let's explicitly generate a color scheme
color.scheme <- rev(brewer.pal(9,"PiYG"))
cdc28_cell_cycle |>
ggplot(aes(x = time, y = gene, fill = expression)) +
geom_raster(aes(fill = expression)) + #
scale_fill_gradientn(limits=c(-2.5, 2.5), colors=color.scheme)The real power of heat maps becomes apparent when you you rearrange the rows of the heat map to emphasize patterns of interest. For example, let’s recreate that heat map in which we sort genes by the time of their maximal expression. This is one way to identify genes that reach their peak expression at similar times, which is one criteria one might use to identify genes acting in concert.
peak.expression <-
cdc28_cell_cycle |>
group_by(gene) |>
summarize(peak = which.max(expression)) |>
arrange(peak)# draw from the bottom row up, whereas we want to depict the
# earliest peaking genes at the top of the figure
gene.ordering <- rev(peak.expression$gene)
better_heat <-
cdc28_cell_cycle |>
ggplot(aes(x = time, y = gene, fill = expression)) +
geom_raster(aes(fill = expression)) + #
scale_fill_gradientn(limits=c(-2.5, 2.5), colors=color.scheme) +
scale_y_discrete(limits=gene.ordering) +
labs(x = "Time (mins)",
y = "Genes",
title = "Cell cycle genes, CDC28 experiment",
subtitle = "Sorted by time of peak expression") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
better_heatggsave("cdc28-heatmap.png", better_heat, width = 4, height = 12, dpi=600)Annotations
ggplot2 extensions
There is an “official” gallery for ggplot2 extensions, which can be found at the following URL: https://exts.ggplot2.tidyverse.org/index.html
We’ve already used some ggplot2 extensions, such as patchwork. Today we’ll introduce a few more
ggrepel – for annotation / labeling; https://ggrepel.slowkow.com/
ggmagnify – magnified insets; https://github.com/hughjonesd/ggmagnify – see also https://ggforce.data-imaginist.com/ for a facet style “zoom”
Basic annotations
Highlighting via layers
A simple way to highlight data of interest is by layering geoms.
Here for example we highlight two points in a data set by stacking two geom_point layers:
# highlight biggest and smallest penguins by body mass
penguins |>
ggplot(aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(
data = filter(penguins |> drop_na(body_mass_g),
body_mass_g >= max(body_mass_g) | body_mass_g <= min(body_mass_g)),
color = "orange",
size = 3) +
geom_point() Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Annotating
ggplot has an annotate function for drawing text, line segments, curves, etc to explicitly call out features in plots.
From the annotate help: “This function adds geoms to a plot, but unlike a typical geom function, the properties of the geoms are not mapped from variables of a data frame, but are instead passed in as vectors”
Here’s an updated version of the figure above to which we’ve added text and arrows:
penguins |>
ggplot(aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(
data = filter(penguins |> drop_na(body_mass_g),
body_mass_g >= max(body_mass_g) | body_mass_g <= min(body_mass_g)),
color = "orange",
size = 3) +
geom_point() +
annotate(geom = "curve",
x = 6000, y = 200, xend = 6300, yend = 220,
curvature = .3, arrow = arrow(length = unit(2, "mm"))) +
annotate(geom = "text",
x = 6000, y = 198,
label = "Kowalski", hjust = "center") +
annotate(geom = "curve",
x = 2800, y = 210, xend = 2690, yend = 193,
curvature = .3, arrow = arrow(length = unit(2, "mm"))) +
annotate(geom = "text",
x = 2800, y = 212,
label = "Pingu", hjust = "center")Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Magnifying regions of a plot
ggmagnify provides a geom for “zooming in” on a region of a plot.
z <- ggplot(penguins,
aes(x = flipper_length_mm,
y = bill_length_mm,
color = species,
shape = sex)) +
geom_point(size=2, alpha=0.6) +
theme_classic() +
geom_magnify(from = c(190,195,45,50), to = c(220, 235, 25,40)) +
coord_cartesian(xlim = c(170,240), ylim = c(20,60)) +
theme(legend.position = "top")
zLabeling points
Labeling points with text is another form of annotation. The function geom_text supports basically labeling, but as we’ll see below it has some limitations. A ggplot extension package called ggrepel extends the functionality of geom_text, allowing for nicer automatic labelling of points.
Example data: Microbescope
The website Information is Beautiful has lots of interesting data visualizations and “data stories”. One of the data stories of biomedical interest is the MicrobeScope, which presents data on the contagiousness of various microbes and its relationship to deadliness, fatality rate, media coverage, etc.
I’ve made a cleaned up version of the data used to produce the figure above available to you as a CSV file at the following URL:
- microbescope_clean.csv – this data was derived from the “original data” sheet in the data provided at http://bit.ly/KIB_Microbescope
microbescope <- read_csv("microbescope_clean.csv")Basic geom_text plot
ggplot(microbescope) +
aes(x = average_basic_reproductive_rate,
y = case_fatality_rate,
shape = pathogen_type,
color = primary_mode_of_transmission) +
geom_point() +
geom_text(aes(label=disease),
size = 6,
size.unit = "pt",
hjust=0, nudge_y = 0.15,
) +
coord_cartesian(xlim = c(0,20))Extended version using ggrepel
my_colors <- c("#335060","#9d9353","#e13698","#37251e",
"#81aea3","#ff8106","#888888", "firebrick")
microbescope |>
drop_na() |>
filter(average_basic_reproductive_rate < 20) |>
ggplot(aes(x = average_basic_reproductive_rate,
y = case_fatality_rate,
shape = pathogen_type,
color = primary_mode_of_transmission)) +
# draw background rectangles
geom_rect(aes(ymin=0, ymax=1,xmin=-1,xmax=21),
alpha=0.003,color='#D3D3D3', linetype="dashed") +
geom_rect(aes(ymin=20, ymax=50,xmin=-1,xmax=21),
alpha=0.003,color='#D3D3D3', linetype="dashed") +
# draw points
geom_point(size=4, alpha=0.6) +
scale_color_manual(values = my_colors) +
# draw labels
geom_text_repel(aes(label=disease),
size = 4,
force = 1,
force_pull = 2) +
scale_y_sqrt(breaks=c(0,0.1,1,seq(10,100,10)),
minor_breaks = NULL,
sec.axis = dup_axis(name = "",
breaks = c(0.2, 8, 35, 80),
labels = c("Less deadly", "Quite deadly", "Deadly", "Extremely Deadly"))) +
coord_cartesian(xlim = c(0,20)) +
labs(x = "Contagiousness: R0",
y = "Deadliness: Case fatality rate (%)") +
theme_bw() +
theme(legend.position = "top", legend.background = element_rect(size = 0.5, colour='grey')) +
guides(color = guide_legend(title = "Primary Mode of Transmission"),
shape = guide_legend(title = "Pathogen Type")) ggsave("microbescope.png", width = 14, height=8)Misc
- GGally – ggpairs; https://ggobi.github.io/ggally/articles/ggpairs.html
The ggpairs function from the package GGally is really useful for exploratory data analysis. It will generate a “matrix” of plots representing the relationships between all pairs of variables in a data frame. It handles continuous and categorical data well:
ggpairs(penguins)